Code
library(tidyverse)
library(sf)
library(mapview)
#library(summarytools)
library(GGally)
library(stargazer)
knitr::opts_chunk$set(echo = TRUE)Quinn He
November 21, 2022
Why did I choose particular interaction terms?
Why did I choose the variables I did for certain models? There must be a reason for each different model
Here is the Miami housing dataset I am using for the project.
Rows: 13932 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): LATITUDE, LONGITUDE, PARCELNO, SALE_PRC, LND_SQFOOT, TOT_LVG_AREA,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Below I read in data I may end up using that relates to the county and state election data. For now, my main data set is miami_housing.
Rows: 4633 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, county
dbl (3): current_votes, total_votes, percent
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 52 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): state
dbl (1): total_votes
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 32177 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): state, county, candidate, party
dbl (1): total_votes
lgl (1): won
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset contains information on 13,932 single-family homes sold in Miami .
Names before column rename: PARCELNO: unique identifier for each property. About 1% appear multiple times. SALE_PRC: sale price (\() LND_SQFOOT: land area (square feet) TOTLVGAREA: floor area (square feet) SPECFEATVAL: value of special features (e.g., swimming pools) (\)) RAIL_DIST: distance to the nearest rail line (an indicator of noise) (feet) OCEAN_DIST: distance to the ocean (feet) WATER_DIST: distance to the nearest body of water (feet) CNTR_DIST: distance to the Miami central business district (feet) SUBCNTR_DI: distance to the nearest subcenter (feet) HWY_DIST: distance to the nearest highway (an indicator of noise) (feet) age: age of the structure avno60plus: dummy variable for airplane noise exceeding an acceptable level structure_quality: quality of the structure month_sold: sale month in 2016 (1 = jan) LATITUDE LONGITUDE
I first want to rename some columns because I am not a fan of the format of the stock column names,
miami_housing <- miami_housing %>%
rename("latitude" = "LATITUDE",
"longitude" = "LONGITUDE",
"sale_price" = "SALE_PRC",
"land_sqfoot" = "LND_SQFOOT",
"floor_sqfoot" = "TOT_LVG_AREA",
"special_features" = "SPEC_FEAT_VAL",
"dist_2_rail" = "RAIL_DIST",
"dist_2_ocean" = "OCEAN_DIST",
"dist_2_nearest_water" = "WATER_DIST",
"dist_2_biz_center" = "CNTR_DIST",
"dis_2_nearest_subcenter"= "SUBCNTR_DI",
"dist_2_hiway" = "HWY_DIST", #close distance is a negative trait
"home_age" = "age") I wonder if it would be worth creating another column called “county” based off of the longitude and latitude coordinates. This would make some graphs more interesting since I could fill by county in a ggplot graph.
Does the saying “location, location, location” really matter when it comes to housing prices in Miami or are there other factors as well that contribute to price?
Outcome variable: Sale price of houses in Miami
Explanatory variable: Distance from the ocean, measured in feet
Hypothesis: Houses closer (lower dist_2_ocean) will have a higher price than houses farther.
I’ll have to go further in depth into this research that I found.
According to Redfin, the median sale price for houses in Miami is $530,000. https://www.redfin.com/city/11458/FL/Miami/housing-market
https://www.proquest.com/docview/222418064?pq-origsite=gscholar&fromopenview=true - This study found that increase in house prices from 2003-2004 was largely due to interest rates, housing construction, unemployment, and household income.
https://link.springer.com/article/10.1023/A:1007751112669 - uses OLS to predict house prices - The problem with the above procedure has to do with neighborhood effects. Every realtor knows that location is an extremely important determinant of the price of a house, and yet incorporating neighborhood effects into an ordinary regression model is problematic, for two reasons: (1) neighborhood quality is an unobservable variable, and (2) the measurement of neighborhood quality ( presuming that it can be measured) requires the knowledge of neighborhood boundaries. The ®rst issue can be resolved by using proxies for neighborhood quality. Variables such as crime rates, school quality measures (such as scores on standardized tests),3 and socioeconomic characteristics (such as race and income) have all been used as proxies for neighborhood quality. - The data used in the analysis come from 1978 multiple listings for Baltimore, Maryland. - The original data set consists of 2157 house sales. Of these, 664 records contained missing values, leaving 1493 valid observations. The valid records were randomly divided into two groupsÐan estimation sample and a prediction sample. The estimation sample contains 1000 observations, and the prediction sample contains 493 observations. -
https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1475-4932.2005.00243.x?casa_token=Lk-oinHQfCoAAAAA:3YFA9GRG6r9UwIJa8-8Z40x77ENRHm07d9CQ7dLdRZD5VoGjSVB4hUUcKU8yWiJapg0csSONiwxZzqWJ
https://www.sciencedirect.com/science/article/pii/S1051137712000228?casa_token=DwJ93qbzqLAAAAAA:8vClfNoxs09D_UL4BCg-Ds36vurfjk8t0uK6Hft50ytFeWo3-XaFlsj5r0WBW4lB1jGqgoaqwXA
https://link.springer.com/article/10.1007/s11146-007-9053-7
https://eprints.gla.ac.uk/221903/1/221903.pdf
spec_tbl_df [13,932 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ latitude : num [1:13932] 25.9 25.9 25.9 25.9 25.9 ...
$ longitude : num [1:13932] -80.2 -80.2 -80.2 -80.2 -80.2 ...
$ PARCELNO : num [1:13932] 6.22e+11 6.22e+11 6.22e+11 6.22e+11 6.22e+11 ...
$ sale_price : num [1:13932] 440000 349000 800000 988000 755000 630000 1020000 850000 250000 1220000 ...
$ land_sqfoot : num [1:13932] 9375 9375 9375 12450 12800 ...
$ floor_sqfoot : num [1:13932] 1753 1715 2276 2058 1684 ...
$ special_features : num [1:13932] 0 0 49206 10033 16681 ...
$ dist_2_rail : num [1:13932] 2816 4359 4413 4585 4063 ...
$ dist_2_ocean : num [1:13932] 12811 10648 10574 10156 10837 ...
$ dist_2_nearest_water : num [1:13932] 348 338 297 0 327 ...
$ dist_2_biz_center : num [1:13932] 42815 43505 43530 43798 43600 ...
$ dis_2_nearest_subcenter: num [1:13932] 37742 37340 37329 37423 37551 ...
$ dist_2_hiway : num [1:13932] 15955 18125 18200 18514 17903 ...
$ home_age : num [1:13932] 67 63 61 63 42 41 63 21 56 63 ...
$ avno60plus : num [1:13932] 0 0 0 0 0 0 0 0 0 0 ...
$ month_sold : num [1:13932] 8 9 2 9 7 2 2 9 3 11 ...
$ structure_quality : num [1:13932] 4 4 4 4 4 4 5 4 4 5 ...
- attr(*, "spec")=
.. cols(
.. LATITUDE = col_double(),
.. LONGITUDE = col_double(),
.. PARCELNO = col_double(),
.. SALE_PRC = col_double(),
.. LND_SQFOOT = col_double(),
.. TOT_LVG_AREA = col_double(),
.. SPEC_FEAT_VAL = col_double(),
.. RAIL_DIST = col_double(),
.. OCEAN_DIST = col_double(),
.. WATER_DIST = col_double(),
.. CNTR_DIST = col_double(),
.. SUBCNTR_DI = col_double(),
.. HWY_DIST = col_double(),
.. age = col_double(),
.. avno60plus = col_double(),
.. month_sold = col_double(),
.. structure_quality = col_double()
.. )
- attr(*, "problems")=<externalptr>
latitude longitude PARCELNO sale_price
Min. :25.43 Min. :-80.54 Min. :1.020e+11 Min. : 72000
1st Qu.:25.62 1st Qu.:-80.40 1st Qu.:1.079e+12 1st Qu.: 235000
Median :25.73 Median :-80.34 Median :3.040e+12 Median : 310000
Mean :25.73 Mean :-80.33 Mean :2.356e+12 Mean : 399942
3rd Qu.:25.85 3rd Qu.:-80.26 3rd Qu.:3.060e+12 3rd Qu.: 428000
Max. :25.97 Max. :-80.12 Max. :3.660e+12 Max. :2650000
land_sqfoot floor_sqfoot special_features dist_2_rail
Min. : 1248 Min. : 854 Min. : 0 Min. : 10.5
1st Qu.: 5400 1st Qu.:1470 1st Qu.: 810 1st Qu.: 3299.4
Median : 7500 Median :1878 Median : 2766 Median : 7106.3
Mean : 8621 Mean :2058 Mean : 9562 Mean : 8348.5
3rd Qu.: 9126 3rd Qu.:2471 3rd Qu.: 12352 3rd Qu.:12102.6
Max. :57064 Max. :6287 Max. :175020 Max. :29621.5
dist_2_ocean dist_2_nearest_water dist_2_biz_center
Min. : 236.1 Min. : 0 Min. : 3826
1st Qu.:18079.3 1st Qu.: 2676 1st Qu.: 42823
Median :28541.8 Median : 6923 Median : 65852
Mean :31691.0 Mean :11960 Mean : 68490
3rd Qu.:44310.7 3rd Qu.:19200 3rd Qu.: 89358
Max. :75744.9 Max. :50400 Max. :159976
dis_2_nearest_subcenter dist_2_hiway home_age avno60plus
Min. : 1463 Min. : 90.2 Min. : 0.00 Min. :0.00000
1st Qu.: 23996 1st Qu.: 2998.1 1st Qu.:14.00 1st Qu.:0.00000
Median : 41110 Median : 6159.8 Median :26.00 Median :0.00000
Mean : 41115 Mean : 7723.8 Mean :30.67 Mean :0.01493
3rd Qu.: 53949 3rd Qu.:10854.2 3rd Qu.:46.00 3rd Qu.:0.00000
Max. :110554 Max. :48167.3 Max. :96.00 Max. :1.00000
month_sold structure_quality
Min. : 1.000 Min. :1.000
1st Qu.: 4.000 1st Qu.:2.000
Median : 7.000 Median :4.000
Mean : 6.656 Mean :3.514
3rd Qu.: 9.000 3rd Qu.:4.000
Max. :12.000 Max. :5.000
At a glance, I do not care about summary statistics for longitude and latitude. Sale price is worth noting with a minimum of $72,000 and a max of $2.6 million. There are some pretty old homes in the data set with the most at almost 100 years old. A minimum age of zero may be something to watch out for because it is not clear as to if the home is less than 1 year old. I must also find out how the distance is measured. I believe distance in this data set is measured in feet so we can see there may be a few water front properties that I will guess fetch a high sale price. Homes farther away may be cheaper if they are not located near another desirable location.
While typically a negative, the distance to highway can be viewed as a positive trait to a home. For the sake of this analysis and to go with typical association with close proximity to a highway, I will view closer distance as a negative.
Another note, is I will have one linear model, m1, to act as the model with all of the variables in it. I want it as a baseline model to see the factor all of the variables play in house price. Since multicollinearity may be a factor, I will not use this model too much.
I want to run a correlation matrix for some of the variables so I can see if there will be issues of multicollinearity. This is checking for multicollinearity for the distance_to variables
Thankfully, it looks like I only have to keep note of a few variables that look to be highly correlated with one another. Distance to business center and subcenter must be in the same area, or at least close enough because 0.766 seems a bit too correlated. Ocean distance and water distance are worth noting, but not to the same degree as the prior two variables since 0.49 is just below what I would consider significant.
By running a few linear models, I am getting closer to the conclusion that there is no one inidicator of sale price, but multiple. Multiple linear regression models will have to be run in order to make a sound conclusion as to what is the most significant indicator of price for single family homes.
Call:
lm(formula = sale_price ~ longitude * latitude, data = miami_housing)
Residuals:
Min 1Q Median 3Q Max
-471277 -166378 -60079 73821 2365411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.850e+10 4.998e+08 37.02 <2e-16 ***
longitude 2.301e+08 6.220e+06 37.00 <2e-16 ***
latitude -7.139e+08 1.939e+07 -36.83 <2e-16 ***
longitude:latitude -8.880e+06 2.413e+05 -36.81 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 294200 on 13928 degrees of freedom
Multiple R-squared: 0.1399, Adjusted R-squared: 0.1397
F-statistic: 755 on 3 and 13928 DF, p-value: < 2.2e-16
According to this model with longitude and latitude as interaction terms, they alone are not too significant in the prediction of house prices.
=============================================================================
Dependent variable:
--------------------------------------------
sale_price
(1) (2) (3)
-----------------------------------------------------------------------------
dist_2_ocean -4.952***
(0.147)
home_age -1,850.640***
(126.087)
land_sqfoot 18.974***
(0.413)
Constant 556,875.800*** 456,699.700*** 236,370.100***
(5,323.914) (4,697.537) (4,349.770)
-----------------------------------------------------------------------------
Observations 13,932 13,932 13,932
R2 0.075 0.015 0.132
Adjusted R2 0.075 0.015 0.132
Residual Std. Error (df = 13930) 305,024.700 314,801.200 295,578.300
F Statistic (df = 1; 13930) 1,136.729*** 215.427*** 2,115.150***
=============================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
I don’t love looking at the notation for sale price, but for now it will do. This visualization of price and distance to ocean is about how I expected it would go. We can see homes closer to the ocean go for the most amount of money, but at the ~40,000 and ~60,000ft distance, there is a spike. This I would like to figure out.
`geom_smooth()` using formula 'y ~ x'

I use mapview here just to get a birds-eye view of the entire Miami area.